Exploratory Data Analysis (EDA) - Generate questions about your data. - Search for answers by visualising, transforming, and modelling your data. - Use what you learn to refine your questions and/or generate new questions.
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
You can loosely word these questions as: What type of variation occurs within my variables? What type of covariation occurs between my variables?
- A variable is a quantity, quality, or property that you can measure.
- A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
- An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.
- Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
Variation is the tendency of the values of a variable to change from measurement to measurement. The best way to understand that pattern is to visualise the distribution of the variable’s values.
diamonds
To examine the distribution of a categorical variable, use a bar chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

diamonds %>%
count(cut)
To examine the distribution of a continuous variable, use a histogram:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

diamonds %>%
count(cut_width(carat, 0.5))
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)

ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)

Some questions that could be asked: - Which values are the most common? Why? - Which values are rare? Why? Does that match your expectations? - Can you see any unusual patterns? What might explain them?
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)

As an example, the histogram above suggests several interesting questions: - Why are there more diamonds at whole carats and common fractions of carats? - Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak? - Why are there no diamonds bigger than 3 carats?
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask: - How are the observations within each cluster similar to each other? - How are the observations in separate clusters different from each other? - How can you explain or describe the clusters? - Why might the appearance of clusters be misleading?
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)

Discover outliers
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)

Zoom in for outliers:
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))

(unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y))
Unreasonable data, probably error in entering them.
Exercise P90
1
LOOK AT THE DATA FIRST!!!
?diamonds
ggplot(data = diamonds, mapping = aes(x = x)) +
geom_histogram(binwidth = 0.25)

ggplot(data = diamonds, mapping = aes(x = y)) +
geom_histogram(binwidth = 0.25)

ggplot(data = diamonds, mapping = aes(x = z)) +
geom_histogram(binwidth = 0.25)

NA
There are peaks, range: x
library(ggplot2)
ggplot(data = diamonds) +
xlim(c(0,10))+
geom_histogram(binwidth = 0.25, mapping = aes(x = x), fill= "green") +
geom_histogram(binwidth = 0.25, mapping = aes(x = y), fill = "yellow") +
geom_histogram(binwidth = 0.25, mapping = aes(x = z), fill = "blue")

xyz_diamonds <- filter(diamonds, x<20, y<20, z<20)
ggplot(data = xyz_diamonds) +
geom_freqpoly(binwidth = 0.2, mapping = aes(x = x), color = "blue") +
geom_freqpoly(binwidth = 0.2, mapping = aes(x = y), color = "green") +
geom_freqpoly(binwidth = 0.2, mapping = aes(x = z), color = "orange")

2
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_histogram(binwidth = 100)

ggplot(data = diamonds, mapping = aes(x = price)) +
geom_histogram(binwidth = 500)

ggplot(data = diamonds, mapping = aes(x = price)) +
geom_histogram(binwidth = 1000)

ggplot(data = diamonds, mapping = aes(x = price)) +
geom_histogram(binwidth = 2000)

diamonds_price <- filter(diamonds, price<2000)
ggplot(data = diamonds_price, mapping = aes(x = price)) +
geom_histogram(binwidth = 50)

There is a gap at the price around 1500.
3
(diamonds_99<- filter(diamonds, carat == 0.99))
(diamonds_1<- filter(diamonds, carat == 1))
(diamonds_carat <- filter(diamonds, carat == 0.99|carat == 1))
diamonds_carat %>%
count(cut_width(carat, 0.01))
People tend to make diamonds of 1 carat… Price vs. carat
4
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)

ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))

ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(xlim = c(0, 10))

ggplot(diamonds) +
geom_histogram(mapping = aes(x = y)) +
coord_cartesian(xlim = c(0, 10))

#when you select a interval not good choice for automatic bin number..
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.2) +
coord_cartesian(xlim = c(0, 10))

ggplot(diamonds) +
geom_histogram(mapping = aes(x = y)) +
coord_cartesian(xlim = c(3.5, 4.5))

ggplot(data = diamonds, mapping = aes(x = price)) +
geom_histogram(binwidth = 100)

ggplot(data = diamonds, mapping = aes(x = price))+
geom_histogram()

ggplot(data = diamonds, mapping = aes(x = x))+
geom_histogram()

R automatically picks a value so that there are 30 bins?
For unusual values
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
diamonds2
ifelse() has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is false.
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()

#it gives a warning..
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)

nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)

Exercise P93
1
ggplot(data = diamonds2, mapping = aes(x = y)) +
geom_bar()

ggplot(data = diamonds2, mapping = aes(x = y)) +
geom_bar(na.rm = TRUE)

ggplot(data = diamonds2, mapping = aes(x = y)) +
geom_histogram()

ggplot(data = diamonds2, mapping = aes(x = y)) +
geom_histogram(na.rm = TRUE)

2
mean(y)
[1] NA
mean(y, na.rm = TRUE)
[1] 12.63907
sum(y)
[1] NA
sum(y, na.rm = TRUE)
[1] 4152200
Covariation is the tendency for the values of two or more variables to vary together in a related way.
The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape.
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

#to see overall difference
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))

To display density
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()

order the data
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()

Exercise P90
1
nycflights13::flights
nycflights13::flights %>%
mutate(
Cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(x = sched_dep_time, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = Cancelled), binwidth = 1/4) +
ylab("Density") +
xlab("Schedules departure time (hour)")

2
diamonds
?diamonds
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point() +
geom_smooth()

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()

ggplot(data = diamonds, mapping = aes(x = color, y = price)) +
geom_boxplot()

ggplot(data = diamonds, mapping = aes(x = clarity, y = price)) +
geom_boxplot()

ggplot(data = diamonds, mapping = aes(x = depth, y = price)) +
geom_point() +
geom_smooth()

ggplot(data = diamonds, mapping = aes(x = table, y = price)) +
geom_point() +
geom_smooth()

ggplot(data = xyz_diamonds, mapping = aes(y = price)) +
geom_point(mapping = aes(x = x), color = "green", alpha = 0.1) +
geom_point(mapping = aes(x = y), color = "yellow", alpha = 0.1) +
geom_point(mapping = aes(x = z), color = "black", alpha = 0.1)

The is correlation between carat and size with price, but the most important to predict price seems to be color.
ggplot(data = diamonds, mapping = aes(x = carat, y = price, color = cut)) +
geom_point()

Diamonds with fair cut but bigger value of carat still get a very high price.There’s few diamonds with ideal cut at bigger carat. And at that range the values seem not correlated. ###3
library(ggstance)
Attaching package: 'ggstance'
The following objects are masked from 'package:ggplot2':
GeomErrorbarh, geom_errorbarh
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy)) +
coord_flip()

ggplot(data = mpg) +
geom_boxploth(mapping = aes(x = hwy, y = class))

Exactly the same?? ###4
library(lvplot)
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_lv()
Error: GeomLv was built with an incompatible version of ggproto.
Please reinstall the package that provides this extension.
5
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_violin()

ggplot(data = diamonds, mapping = aes(x = price)) +
geom_histogram() +
facet_wrap(~cut)

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

Violin: see the trend of each category clearly Hist & facet: can also compare the number Frequency: compare shapes directly because they overlap ggbeeswarm
6
library(ggbeeswarm)
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_point()

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_jitter()

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
geom_beeswarm(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
geom_beeswarm(mapping = aes(x = displ, y = hwy), groupOnX = F)

Two categorical variables
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))

diamonds %>%
count(color, cut)
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))

Exercise P101
1
library(viridis)
diamonds %>%
count(color, cut) %>%
group_by(color) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop)) +
scale_fill_viridis(limits = c(0, 1))

2
nycflights13::flights
nycflights13::flights %>%
group_by(dest, month) %>%
summarise(avg_delay = mean(dep_delay, na.rm = T)) %>%
ggplot(mapping = aes(x = dest, y = month)) +
geom_tile(mapping = aes(fill = avg_delay))

The destination overlaps. Missing values.
nycflights13::flights %>%
group_by(dest) %>%
summarise(avg_delay = mean(dep_delay, na.rm = T))
nycflights13::flights %>%
filter(dest == dest[c(10:20)]) %>%
group_by(dest, month) %>%
summarise(avg_delay = mean(dep_delay, na.rm = T)) %>%
ggplot(mapping = aes(x = factor(month), y = dest)) +
geom_tile(mapping = aes(fill = avg_delay)) +
labs(x = "Month", y = "Destination", fill = "Departure Delay")

3
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = cut, y = color)) +
geom_tile(mapping = aes(fill = n))

diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))

Because colors can not be compared but cut can? So easier to understand…
Two continuous variables
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))

Data overlaps a lot.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)

library(hexbin)
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))

ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))

ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

Make the width of the boxplot proportional to the number of points
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)), varwidth = TRUE)

Display approximately the same number of points in each bin.
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

Exercise P104
1
ggplot(data = smaller, mapping = aes(x = price,y = ..density..)) +
geom_freqpoly(mapping = aes(color = cut_number(carat, 10)))

ggplot(data = smaller, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(color = cut_width(carat, 0.5)))

Two methods yeild very different graph. Cut_number same as its density graph. ###2
ggplot(data = smaller, mapping = aes(x = price, y = carat)) +
geom_boxplot(mapping = aes(group = cut_number(price, 20)))

ggplot(data = smaller, mapping = aes(x = price, y = carat)) +
geom_boxplot(mapping = aes(group = cut_width(price, 1000,boundary = 0)))

3
It seems that when the price is high, the price doesn’t correlate to the size anymore. Maybe some other factors influences.
4
diamonds %>%
#filter(carat<3) %>%
ggplot(mapping = aes(x = carat, y = cut)) +
geom_tile(mapping = aes(fill = price))

Price correlate with carat more than cut.
diamonds %>%
#filter(carat<3) %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point(alpha = 0.5) +
facet_wrap(~cut)

More diamonds with large size for fair cut. Other than that not much difference between the different cut.
5
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Because outliers are not shown in binned plots.
Patterns and models
Could this pattern be due to coincidence (i.e. random chance)? How can you describe the relationship implied by the pattern? How strong is the relationship implied by the pattern? What other variables might affect the relationship? Does the relationship change if you look at individual subgroups of the data?
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))

If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it.
library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))

relative to their size, better quality diamonds are more expensive.
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))

ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)

ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)

http://www.cookbook-r.com/Graphs/
---
title: "R Notebook"
output: html_notebook
---
Exploratory Data Analysis (EDA)
- Generate questions about your data.
- Search for answers by visualising, transforming, and modelling your data.
- Use what you learn to refine your questions and/or generate new questions.
```{r}
library(tidyverse)
```
You can loosely word these questions as:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?

- A variable is a quantity, quality, or property that you can measure.
- A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
- An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.
- Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.

Variation is the tendency of the values of a variable to change from measurement to measurement. The best way to understand that pattern is to visualise the distribution of the variable’s values.
```{r}
diamonds
```

To examine the distribution of a categorical variable, use a bar chart:
```{r}
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))
```
```{r}
diamonds %>% 
  count(cut)
```
To examine the distribution of a continuous variable, use a histogram:
```{r}
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
```
```{r}
diamonds %>% 
  count(cut_width(carat, 0.5))
```
```{r}
smaller <- diamonds %>% 
  filter(carat < 3)
  
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)
```
```{r}
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)
```
Some questions that could be asked:
- Which values are the most common? Why?
- Which values are rare? Why? Does that match your expectations?
- Can you see any unusual patterns? What might explain them?
```{r}
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)
```
As an example, the histogram above suggests several interesting questions:
- Why are there more diamonds at whole carats and common fractions of carats?
- Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
- Why are there no diamonds bigger than 3 carats?

Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
- How are the observations within each cluster similar to each other?
- How are the observations in separate clusters different from each other?
- How can you explain or describe the clusters?
- Why might the appearance of clusters be misleading?

```{r}
ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_histogram(binwidth = 0.25)
```

Discover outliers
```{r}
ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)
```
Zoom in for outliers:
```{r}
ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))
```
```{r}
(unusual <- diamonds %>% 
  filter(y < 3 | y > 20) %>% 
  select(price, x, y, z) %>%
  arrange(y))
```
Unreasonable data, probably error in entering them.

##Exercise P90
###1
LOOK AT THE DATA FIRST!!!
```{r}
?diamonds
```

```{r}
ggplot(data = diamonds, mapping = aes(x = x)) + 
  geom_histogram(binwidth = 0.25)

ggplot(data = diamonds, mapping = aes(x = y)) + 
  geom_histogram(binwidth = 0.25)

ggplot(data = diamonds, mapping = aes(x = z)) + 
  geom_histogram(binwidth = 0.25)
  
```
There are peaks, range: x<z<y
range; outliers

```{r}
library(ggplot2)
ggplot(data = diamonds) + 
  xlim(c(0,10))+
  geom_histogram(binwidth = 0.25, mapping = aes(x = x), fill= "green") +
  geom_histogram(binwidth = 0.25, mapping = aes(x = y), fill = "yellow") +
  geom_histogram(binwidth = 0.25, mapping = aes(x = z), fill = "blue")
```


```{r}
xyz_diamonds <- filter(diamonds, x<20, y<20, z<20)
ggplot(data = xyz_diamonds) +
  geom_freqpoly(binwidth = 0.2, mapping = aes(x = x), color = "blue") +
  geom_freqpoly(binwidth = 0.2, mapping = aes(x = y), color = "green") +
  geom_freqpoly(binwidth = 0.2, mapping = aes(x = z), color = "orange")
```
###2
```{r}
ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_histogram(binwidth = 100)

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_histogram(binwidth = 500)

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_histogram(binwidth = 1000)

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_histogram(binwidth = 2000)
```
```{r}
diamonds_price <- filter(diamonds, price<2000)
ggplot(data = diamonds_price, mapping = aes(x = price)) + 
  geom_histogram(binwidth = 50)
```
There is a gap at the price around 1500.

###3
```{r}
(diamonds_99<- filter(diamonds, carat == 0.99))
(diamonds_1<- filter(diamonds, carat == 1))

(diamonds_carat <- filter(diamonds, carat == 0.99|carat == 1))
diamonds_carat %>%
  count(cut_width(carat, 0.01))
```
People tend to make diamonds of 1 carat... 
Price vs. carat

###4
```{r}
ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) 

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(xlim = c(0, 10))

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y)) +
  coord_cartesian(xlim = c(0, 10))
#when you select a interval not good choice for automatic bin number..
ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.2) +
  coord_cartesian(xlim = c(0, 10))


ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y)) +
  coord_cartesian(xlim = c(3.5, 4.5))
```
```{r}
ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_histogram(binwidth = 100)

ggplot(data = diamonds, mapping = aes(x = price))+ 
  geom_histogram()

ggplot(data = diamonds, mapping = aes(x = x))+ 
  geom_histogram()
```
R automatically picks a value so that there are 30 bins?


For unusual values 
```{r}
diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))
diamonds2
```
ifelse() has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is false.

```{r}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point()
#it gives a warning..
```
```{r}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point(na.rm = TRUE)
```
```{r}
nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
```

##Exercise P93
###1
```{r}
ggplot(data = diamonds2, mapping = aes(x = y)) + 
  geom_bar()
```
```{r}
ggplot(data = diamonds2, mapping = aes(x = y)) + 
  geom_bar(na.rm = TRUE)
```

```{r}
ggplot(data = diamonds2, mapping = aes(x = y)) + 
  geom_histogram()
```
```{r}
ggplot(data = diamonds2, mapping = aes(x = y)) + 
  geom_histogram(na.rm = TRUE)
```
###2
```{r}
mean(y)
mean(y, na.rm = TRUE)

sum(y)
sum(y, na.rm = TRUE)
```


Covariation is the tendency for the values of two or more variables to vary together in a related way. 

The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape. 
```{r}
ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

#to see overall difference
ggplot(diamonds) + 
  geom_bar(mapping = aes(x = cut))
```
To display density
```{r}
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```
```{r}
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()
```
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()
```
order the data
```{r}
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
```
```{r}
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()
```
##Exercise P90
###1
```{r}
nycflights13::flights
```
```{r}
nycflights13::flights %>% 
  mutate(
    Cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(x = sched_dep_time, y = ..density..)) + 
    geom_freqpoly(mapping = aes(colour = Cancelled), binwidth = 1/4) +
  ylab("Density") +
  xlab("Schedules departure time (hour)")
```
###2
```{r}
diamonds
?diamonds
```
```{r}
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth()

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()

ggplot(data = diamonds, mapping = aes(x = color, y = price)) +
  geom_boxplot()

ggplot(data = diamonds, mapping = aes(x = clarity, y = price)) +
  geom_boxplot()

ggplot(data = diamonds, mapping = aes(x = depth, y = price)) +
  geom_point() +
  geom_smooth()

ggplot(data = diamonds, mapping = aes(x = table, y = price)) +
  geom_point() +
  geom_smooth()

ggplot(data = xyz_diamonds, mapping = aes(y = price)) +
  geom_point(mapping = aes(x = x), color = "green", alpha = 0.1) +
  geom_point(mapping = aes(x = y), color = "yellow", alpha = 0.1) +
  geom_point(mapping = aes(x = z), color = "black", alpha = 0.1)
```
The is correlation between carat and size with price, but the most important to predict price seems to be color.
```{r}
ggplot(data = diamonds, mapping = aes(x = carat, y = price, color = cut)) +
  geom_point()
```
Diamonds with fair cut but bigger value of carat still get a very high price.There's few diamonds with ideal cut at bigger carat. And at that range the values seem not correlated.
###3
```{r}
library(ggstance)
```
```{r}
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = class, y = hwy)) +
  coord_flip()

ggplot(data = mpg) +
  geom_boxploth(mapping = aes(x = hwy, y = class))
```
Exactly the same??
###4
```{r}
library(lvplot)
```

```{r}
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_lv()
```
###5
```{r}
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + 
  geom_violin()

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_histogram() +
  facet_wrap(~cut)

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```
Violin: see the trend of each category clearly
Hist & facet: can also compare the number
Frequency: compare shapes directly because they overlap
ggbeeswarm

###6
```{r}
library(ggbeeswarm)
```
```{r}
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + 
  geom_point()

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + 
  geom_jitter()
```
```{r}
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_jitter(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(x = displ, y = hwy), groupOnX = F)
```
## Two categorical variables
```{r}
ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))
```
```{r}
diamonds %>% 
  count(color, cut)
#for every combination of color and count
```
```{r}
diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))
```

## Exercise P101
###1
```{r}
library(viridis)
```
```{r}
diamonds %>% 
  count(color, cut) %>%
  group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop))  +
  scale_fill_viridis(limits = c(0, 1))
```
###2
```{r}
nycflights13::flights
```
```{r}
nycflights13::flights %>% 
  group_by(dest, month) %>%  
  summarise(avg_delay = mean(dep_delay, na.rm = T)) %>%  
  ggplot(mapping = aes(x = dest, y = month)) +
    geom_tile(mapping = aes(fill = avg_delay))
```
The destination overlaps. Missing values. 

```{r}
nycflights13::flights %>% 
  group_by(dest) %>%  
  summarise(avg_delay = mean(dep_delay, na.rm = T))
```
```{r}
nycflights13::flights %>% 
  filter(dest == dest[c(10:20)]) %>%
  group_by(dest, month) %>%  
  summarise(avg_delay = mean(dep_delay, na.rm = T)) %>%  
  ggplot(mapping = aes(x = factor(month), y = dest)) +
    geom_tile(mapping = aes(fill = avg_delay)) +
labs(x = "Month", y = "Destination", fill = "Departure Delay")
```
###3
```{r}
diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = cut, y = color)) +
    geom_tile(mapping = aes(fill = n))

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))
```
Because colors can not be compared but cut can? So easier to understand...

##Two continuous variables
```{r}
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))
```
Data overlaps a lot.
```{r}
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
```
```{r}
library(hexbin)
```
```{r}
ggplot(data = smaller) +
  geom_bin2d(mapping = aes(x = carat, y = price))

ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price))
```
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
```
Make the width of the boxplot proportional to the number of points
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)), varwidth = TRUE)
```
Display approximately the same number of points in each bin.
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
```

##Exercise P104
###1
```{r}
ggplot(data = smaller, mapping = aes(x = price,y = ..density..)) + 
  geom_freqpoly(mapping = aes(color = cut_number(carat, 10)))

ggplot(data = smaller, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(color = cut_width(carat, 0.5)))
```
Two methods yeild very different graph. Cut_number same as its density graph.
###2
```{r}
ggplot(data = smaller, mapping = aes(x = price, y = carat)) + 
  geom_boxplot(mapping = aes(group = cut_number(price, 20)))

ggplot(data = smaller, mapping = aes(x = price, y = carat)) + 
  geom_boxplot(mapping = aes(group = cut_width(price, 1000,boundary = 0)))
```
###3
It seems that when the price is high, the price doesn't correlate to the size anymore. Maybe some other factors influences.

###4
```{r}
diamonds %>% 
  #filter(carat<3) %>% 
  ggplot(mapping = aes(x = carat, y = cut)) +
    geom_tile(mapping = aes(fill = price))
```
Price correlate with carat more than cut.
```{r}
diamonds %>% 
  #filter(carat<3) %>% 
  ggplot(mapping = aes(x = carat, y = price)) +
  geom_point(alpha = 0.5) +
    facet_wrap(~cut)
```
More diamonds with large size for fair cut. Other than that not much difference between the different cut.

###5
```{r}
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
```{r}
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
Because outliers are not shown in binned plots.

##Patterns and models
Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?
```{r}
ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))
```
If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it.

```{r}
library(modelr)

mod <- lm(log(price) ~ log(carat), data = diamonds)

diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(data = diamonds2) + 
  geom_point(mapping = aes(x = carat, y = resid))
```

relative to their size, better quality diamonds are more expensive.
```{r}
ggplot(data = diamonds2) + 
  geom_boxplot(mapping = aes(x = cut, y = resid))
```
```{r}
ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_freqpoly(binwidth = 0.25)

ggplot(faithful, aes(eruptions)) + 
  geom_freqpoly(binwidth = 0.25)
```
http://www.cookbook-r.com/Graphs/

